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Abstract 



We propose a new, nonparametric method for multivariate regression subject to 
convexity or concavity constraints on the response function. Convexity constraints 
are common in economics, statistics, operations research, financial engineering and 
optimization, but there is currently no multivariate method that is computationally 
feasible for more than a few hundred observations. We introduce Convex Adaptive 
Partitioning (CAP), which creates a globally convex regression model from locally 
linear estimates fit on adaptively selected covariate partitions. CAP is computationally 
efficient, in stark contrast to current methods. The most popular method, the least 
squares estimator, has a computational complexity of 0(n 3 ). We show that CAP has a 
^ computational complexity of C(nlog(n) log(log(n))) and also give consistency results. 

t> CAP is applied to value function approximation for pricing American basket options 

^ with a large number of underlying assets. 

ON 

Key words: Nonparametric regression, shape constraint, convex regression, treed lin- 
ear model, adaptive partitioning 

o 

^ 1 Introduction 

^ Consider the regression model for x G X C W and y G K, 

& 

V = /o(x) + e, 



where f : M. p — > R and e is a mean random variable. In this paper, we study the situation 
where /o is subject to a convexity constraint. That is, 

A/ (xi) + (1 - A)/ (x 2 ) > /o(Axx + (1 - A)x 2 ), 

for every Xi,x 2 G X and A G (0, 1). Given the observations (xi,yi), . . . , (x n ,y n ), we would 
like to estimate /o subject to the convexity constraint; this is called the convex regression 
problem. Note that convex regression is easily extended to concave regression since a concave 
function is the negative of a convex function. 
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Convex regression problems occur in a variety of settings. Economic theory dictates 
that demand (Varian 1982), production (Varian 1984, Allon et al.||2007 ) and consumer pref- 
erence (Boyd & Vandenberghe 2004) functions are often concave. In financial engineering, 



stock option prices usually have convexity restrictions ( Ait-Sahalia &: Duarte|2003 ). Stochas- 
tic optimization problems, studied in operations research and reinforcement learning, have 



response surfaces (Lim 2010) or value-to-go functions that exhibit concavity in many set- 



tings, like resource allocation (Topaloglu fo Powell 2003, Powell 2007 Toriello et al. 2010) 



or stochastic control (Keshavarz et al. 2011). Similarly, efficient frontier methods like data 



envelopment analysis ( Kuosmanen fo Johnson||2010 ) include convexity constraints. In statis- 



tics, shape restrictions like log-concavity are useful in density estimation (Cule et al. 2010 



Cule fo Samworth 2010, Schuhmacher fo Dumbgen 2010). Finally, in optimization, convex 



approximations to posynomial constraints are valuable for geometric programming (Kim 



et al. 2004, Boyd et al. 2007, Magnani fo Boyd 2009). Although convex regression has been 



well-explored in the univariate setting, the literature remains underdeveloped in the multi- 
variate setting. Existing methods do not scale well to more than a few thousand observations 
or more than a handful of dimensions. 

In this paper, we introduce the first computationally efficient, theoretically sound multi- 
variate convex regression method, called Convex Adaptive Partitioning (CAP). It relies on 
an alternate definition of convexity, 

/o(xi) > / (x 2 ) + So(xi) T ( Xl - x 2 ), (1) 

for every x 1; x 2 E X, where go( x ) £ <9/o( x ) is a subgradient of f at x. Equation states 
that a convex function lies above all of its supporting hyperplanes, or subgradients tangent to 
fo. Moreover, with enough supporting hyperplanes, fo can be approximately reconstructed 
by taking the maximum over those hyperplanes. 

The CAP estimator is formed by adaptively partitioning a set of observations. Within 
each subset of the partition, we fit a linear model to approximate the subgradient of fo within 



a continuous, 



that subset. Given a partition with K subsets and linear models, (afc,/3, 
convex (concave) function is then generated by taking the maximum (minimum) over the 
hyperplanes by 



/„(x) = max a k + f3 k x. 
ke{i,...,K} 

The partition is refined by a twofold strategy. First, one of the subsets is split along a 
cardinal direction (say, x\ or £3) to grow K. Then, the hyperplanes themselves are used to 
refit the subsets. A piecewise linear function like /„ induces a partition; a subset is defined 
as the region where a particular hyperplane is dominant. The refitting step places the 
hyperplanes in closer alignment with the observations that generated them. This procedure 
is repeated until all subsets have a minimal number of observations. The CAP estimator is 
then created by selecting the value of K that balances fit with complexity using a generalized 



cross validation method (Golub et al. 1979, Friedman 1991) 



CAP has strong theoretical properties, both in terms of computational complexity and 
asymptotic properties. We show that CAP is consistent with respect to the metric and 
has a computational complexity of 0{p{p + l) 2 nlog(n) log(log(n))) flops. The most widely 
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implemented convex regression method, the least squares estimator, has only recently been 



shown to be consistent (Seijo & Sen 2011 
complexity of 0((p + l) 3 



Lim & Glynn 2011) and has a computational 
Despite a difference of almost 0(n 2 ) 



J n") flops. Despite a difference of almost 0{n ) runtime, the CAP 
estimator usually has better predictive error as well. Because of its dramatic reduction in 
runtime, CAP opens a new class of problems for study, namely moderate to large problems 
with convexity or concavity constraints. 

The rest of this paper is organized as follows. In Section |2j we review the literature 
on convex regression. In Section [3j we present the CAP algorithm. In Section [4], we give 
computational complexity results and conditions for consistency. In Section [5j we derive a 
generalized cross-validation method and give a fast approximation for the full CAP algorithm. 
In Section |6j we empirically test CAP on convex regression problems, including value function 
estimation for pricing American basket options. In Section [7| we discuss our results and give 
directions for future work. 



2 Literature Review 



The literature for nonparametric convex regression is dispersed over a variety of fields, includ- 
ing statistics, operations research, economics, numerical analysis and electrical engineering. 
There seems to be little communication between the fields, leading to the independent dis- 
covery of similar techniques. 

In the univariate setting, there are many computationally efficient algorithms for convex 
regression. These methods rely on the ordering implicit to the real line. Setting Xi-\ < Xi < 
x i+ i for i — 2, . . . , n, 



fo(xi) - / (zi_i) ^ fo(x l+1 ) - f (Xi) 



< 



2 n, 



X -; 



Xi-l 



(2) 



is a sufficient constraint for pointwise convexity. When f is differentiable, Equation (|2j) is 
equivalent to an increasing derivative function. 

Various methods have been used to solve the univariate convex regression problem. The 
least squares estimator (LSE) is the oldest and simplest method. It produces a piecewise 
linear estimator by solving a quadratic program with n — 2 linear constraints ( Hildreth||1954 
Although the LSE is completely free of tunable parameters 
particularly in the multivariate setting. 



Dent 1973) 



the estimator 
Consistency, rate 



is not smooth and can overfit 
of convergence, and asymptotic distribution of the LSE were shown by Hanson Pledger 
(1976), Mammen (1991) and Groeneboom et al. ( 2001[), res pectively. Algorithmic methods 
for solving the quadratic program were given in Wu (1982), Dykstra (1983) and Fraser & 



Massam (1989). 



Spline methods have also been popular. Meyer (2008) and Meyer et al. (2011) used 



convex-restricted splines with positive parameters in frequentist and Bayesian settings, re- 



spectively. Turlach (2005) and Shively et al. (2011) used unrestricted splines with restricted 



parameters, likewise, in frequentist and Bayesian settings. In other methods, Birke fc Dette 



(2007) used convexity constrained kernel regression; Chang et al. (2007) used a random 



Bernstein polynomial prior with constrained parameters; and Koushanfar et al. (2010) trans 
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formed the ordering problem into a combinatorial optimization problem which they solved 
with dynamic programming. 

Due to the constraint on the derivative of fo, univariate convex regression is quite sim- 
ilar to univariate isotonic regression. The latter has been studied extensively with many 



and Shively et al. (2009). 



approaches; for examples, see Brunk (1955), Hall & Huang (2001), Neelon & Dunson (2004) 



Unlike the univariate setting, convex functions in multiple dimensions cannot be repre- 
sented by a simple set of first order conditions and projection onto the set of convex functions 
becomes computationally intensive. As in the univariate case, the earliest and most popular 
regression method is the LSE, which directly projects a least squares estimator onto the 



cone of convex functions. It was introduced by Hildreth (1954) and Holloway (1979). The 



estimator is found by solving the quadratic program, 



mm 



(3) 



subject to ijj > & + g[ (x,- - x*), i,j = 1,. 



n. 



Here, iji and gj are the estimated values of /o( x «) an d the subgradient of f at x^, respectively. 
The estimator is piecewise linear, 



iLSEf 



max iji + 

i£{l,...,n} 



The characterization (Kuosmanen 2008) and consistency (Seijo & Sen 2011, Lim & Glynn 



2011) of the least squares problem have only recently been studied. The LSE quickly be- 



comes impractical due to its size: Equation ^ has n{n — 1) constraints. This results in 
a computational complexity of 0((p + l) 3 n 3 ), which becomes impractical after one to two 
thousand observations. 

While the LSE is widely studied across all fields, the remaining literature on multivariate 
convex regression is sparser and more dispersed than the univariate literature. One approach 
is to place a positive semi-definite restriction on the Hessian of the estimator. In the eco- 



nomics literature, Henderson & Parmeter (2009) used kernel smoothing with a restricted 



Hessian and found a solution with sequential quadratic programming. In electrical engineer- 



ing, Roy et al. (2007), and in a variational setting, Aguilera & Morin (2008) and Aguilera 



& Morin (2009), used semi-definite programming to search the space of functions with pos- 



itive semi-definite local Hessians. Although consistent in some cases (Aguilera & Morin 



2008, 2009), Hessian methods are computationally intensive and can be poorly conditioned 



in boundary regions. In another approach, Allon et al. (2007) proposed a method based on 



reformulating the maximum likelihood problem as one minimizing entropic distance, which 
can be solved as a linear program. However, like the original maximum likelihood problem, 
the transformed problem still has n 2 constraints and does not scale to more than a few 
thousand observations. 

Recently, multivariate convex regression methods have been proposed with a more tradi- 



tional statistics approach. Aguilera et al. (2011) proposed a two step smoothing and fitting 



process. First, the data were smoothed and functional estimates were generated over an 
e-net over the domain. Then the convex hull of the smoothed estimate was used as a convex 
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CAP 

(A) 



Least Squares Estimator 
(B) 




Figure 1: (A) The CAP estimator, and (B) the LSE fit to 500 observations drawn from 
y = x\ + x\ + e, where e ~ A^(0,0.25 2 ). The covariates were drawn from a 2 dimensional 
uniform distribution, Unif [— 1, l] 2 . The LSE was truncated at predicted values of 2.5 for 
display, although some predicted values reached as high as 4,800 on [—1, l] 2 . 



estimator. Again, although this method is consistent, it is sensitive to the choice of smooth- 
ing parameter and does not scale to more than a few dimensions. Hannah & Dunson (2011) 
proposed a Bayesian model that placed a prior over the set of all piecewise linear models. 
They were able to show adaptive rates of convergence, but the inference algorithm did not 
scale to more than a few thousand observations. 
In a more computational approach, 



Magnani & Boyd (2009) use an iterative fitting 



scheme. In this method, the data were divided into K random subsets and a linear model 
was fit within each subset; a convex function was generated by taking the maximum over 
these hyperplanes. The hyperplanes induce a new partition, which is then used to refit the 
function. This sequence was repeated until convergence. Despite relatively strong empirical 
performance, this method is sensitive to the initial partition and the choice of K. Moreover, 
it is not consistent and there are cases when the algorithm does not even converge. 



3 The CAP Algorithm 

As seen in much of the literature, a natural way to model a convex function fo is through the 
maximum of a set of hyperplanes. One example of this method is the least squares estimator, 
which fits every observation with its own hyperplane. This is computationally expensive and 
can result in overfitting, as shown in Figure [TJ Instead, we wish to model fo through only K 
hyperplanes. We do this by partitioning the covariate space and approximating the gradients 
within each region by hyperplanes generated by the least squares estimator. The covariate 
space partition and K are chosen through adaptive partitioning. 

Given a partition {Ax, . . . , A^} of X, an estimate of the gradient for each subset can be 
created by taking the least squares linear estimate based on all of the observations within 
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that region, 



{oik, 0k 



arg min ^ (vi-a- /3 T x;) 2 . 



A convex function / can be created by taking the maximum over (a^, 0k)k=v 

/(x) = max «fc + fiTx.. 
ke{l,...,K} 

Models adaptive partitioning models with linear leaves have been proposed before; see 



Chaudhuri et al. (1994 1995), Alexander & Grimshaw (1996), Nobel (1996), Dobra & Gehrke 



(2002), Gyorfi et al. (2002) and Potts & Sammut (2005) for examples. In most of these cases, 



the partition is created by adaptively refining an existing partition by dyadic splitting of one 
subset. The split is chosen in a way that minimizes local error within the subset. There are 
two problems with these partitioning methods that arise when a piecewise linear summation 
function, 



K 



/*( X ) = + 1 { x GA fc }, 



k=l 

is changed into a piecewise linear maximization function, like /. First, a split that minimizes 
local error does not necessarily minimize global error for /. This is fairly easy to remedy by 
considering splits based on minimizing global error. The second problem is more difficult: 
the gradients often act in areas over which they were not estimated. 

A piecewise linear maximization function, /, generates a new partition, {A' ± , . . . ,A' K }, 

by 

A' k = {xG^ : a k + /3jx > atj + /3jx, V j ^ k) . 

The partition {A%, . . . , Ak} is not necessarily the same as {A' x , . . . , A' K }. We can use this 
new partition to refit the hyperplanes and produce a significantly better estimate. Refitting 
hyperplanes in this manner can be viewed as a Gauss-Newton method for the non-linear 



least squares problem (Magnani & Boyd 2009), 



minimize 



Hi - max (a k + 01- 
ke{i,...,K} x 



Breiman 


(1993 


) and 


Mag- 



nani & Boyd (2009). However, repeated refitting may not converge to a stationary partition 
and is sensitive to the initial partition. 

Convex Adaptive Partitioning (CAP) uses adaptive partitioning with linear leaves to fit 
a convex function that is defined as the maximum over the set of leaves. The adaptive 
partitioning itself differs from previous methods in order to fit piecewise linear maximiza- 
tion functions. Partitions are refined in two steps. First, candidate splits are generated 
through dyadic splits of existing partitions. These are evaluated and the one that minimizes 
global error is greedily selected. Second, the new partition is then refit. Although simple, 
these rules, and refitting in particular, produce large gains over naive adaptive partitioning 
methods; empirical results are discussed in Section |6j 
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Most other adaptive partitioning methods use backfitting or pruning to select the tree 
or partition size. Due to the construction of the CAP estimator, we cannot locally prune 
and so instead we rely on model selection criteria. We derive a generalized cross-validation 



method for this setting that is used to select K. This is discussed in Section 5.1 



Since CAP shares many feature with existing adaptive partitioning algorithms, we are 
able to use many adaptive partitioning results to study CAP practically and theoretically. 



This is in stark contrast to the least squares estimator. Despite being introduced by Hildreth 



(1954), it has not been implemented in many practical settings ( Lim||2010 ) and has only very 
recently been shown to be consistent (Seijo Sz Sen 2011, Lim & Glynn 2011). 



3.1 The Algorithm 

We now introduce some notation required for Convex Adaptive Partitioning. When presented 
with data, a partition can be defined over the covariate space (denoted by {Ai, . . . , Ak}, with 
Ak Q X) or over the observation space (denoted by {C±, . . . , Ck}, with Ck C {1, . . . , n}). 
The observation partition is defined from the covariate partition, 

C k = {i : Xj G Ak] , k = 1,. . . ,K. 

CAP proposes and searches over a set of models, M 1; ...,M^. A model is defined 
by: 1) the covariate partition {A±, . . . , Ak}, 2) the corresponding observation partition, 
{Ci, . . . ,Ck}, and 3) the hyperplanes (acj, (3j)j =1 fit to those partitions. 

The CAP algorithm progressively refines the partition until each subset cannot be split 
without one subset having fewer than a minimal number of observations, n m in, where 

n min = min <^ — - — - 2(d + 1) \. 

[D log{n) J 

Here D is a log scaling factor, which acts to change the base of the log operator. This minimal 
number is chosen so that 1) there are enough observations to accurately fit a hyperplane, 
and 2) there is a lower bound on the growth rate for the number of observations in each 
subset — and an upper bound on the number of subsets. This is used to show consistency. 

We briefly outline the CAP algorithm below. Details are given in the following subsec- 
tions. 



Convex Adaptive Partitioning (CAP) 

1. Initialize. Set K = 1; place all observations into a single observation subset, G\ = 
{1, . . . , n}; A\ = X; this defines model M 1 . 

2. Split. Refine partition by splitting a subset. 

a. Generate candidate splits. Generate candidate model Mkji by 1) fixing a subset k, 
2) fixing a dimension j, 3) dyadically dividing the data in subset k and dimensions 
j according to knot ag. This is done for L knots, all p dimensions and K subsets. 
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b. Select split. Choose the model M^+i from the candidates that minimizes global 
mean squared error on the training set and satisfies min^ \C k \ > n m i n . Set K = 
K + l. 

3. Refit. Use the partition induced by the hyperplanes to generate model M k . Set 
M K = M K if for every subset C k in \ /;,.. \C k \ > n min . 

4. Stopping conditions. If for every subset C k in M k , \C k \ < 2n m i n , stop fitting and 
proceed to step |5j Otherwise, go to step |2j 

5. Select model size. Each model M k creates an estimator, 

/nfc(x) = max atj + pjx. 

Use generalized cross-validation on the estimators to select final model M* from {M k }^ =1 . 



3.2 Splitting Rules 

To split, we create a collection of candidate models by dyadically splitting a single subset. 
Since the best way to do this is not apparent, we create models for every subset and search 
along every cardinal direction by splitting the data along that direction. We create model 
Mkji by 1) fixing subset k £ {1, . . . , K}, and 2) fixing dimension j £ {1, . . . ,p}. Let x^ in 
be the minimum value and x%[ ax be the maximum value of the covariates in this subset and 
dimension, 

x min = minfcy : % £ C k }, x k ± x = max{xy : i £ C k }. 

Let < a\ < ■ ■ ■ < a,L < 1 be a set of evenly spaced knots that represent the proportion 
between x% in and x% ax . 

Use the weighted average b k ji = ^i^mm + (1 ~ a t) x max ^° S P^^ C k and A k in dimension 
j. Set 

Cfc = {i : z £ Cjfc, Xij < b kje } } C K +i = {i:i £ C k , > b kji }, 

A k = {x : x £ A k , Xj < b kj i}, A K+1 = {x : x £ A k , Xj > b kji }. 

These define new subset and covariate partitions, CW + i and Ax-k+i where C k < = C k i and 
C k > = Ck> for k' k. Fit hyperplanes (& k , fik) k =i in each of the subsets. The triplet of 
observation partition CW+i, covariate partition, Ai-k+i, and set of hyperplanes (a k , fik) k =i 
defines the model M k j£. This is done for k — 1, . . . , K, j — 1, . . . ,p and £ = 1, . . . , L. After 
all models are generated, set K = K + 1. 

We note that any models where min^ \C k \ < n min are discarded. If all models are dis- 
carded in one subset /dimension pair, we produce a model by splitting on the subset median 
in that dimension. 
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3.3 Split Selection 

We select the model Mkjt that gives the smallest global error. Let (a^ 3 ' , ' be the 
hyperplanes associated with Mkji and let 

/nfc( X ) = . maX . ^ + 
i£{l,...,K} 

be its estimator. We set the model Mk to be the one that minimizes global mean squared 
error, 

M kji : (k, J, t) = arg mm - (vi ~ /n£( x *)) j ■ 
Set f n K to be the minimal estimator. 

3.4 Refitting 

We refit by using the partition induced by the hyperplanes. Let {ol-l-k, Pi-.k) be the hyper- 
planes associated with M K . Refit the partitions by 

C k = {xj : a k + /3jxi > a s + /3j*i,j ^ k} 

for k = 1, . . . , K. The covariate partition, A\-_k is defined in a similar manner. Fit hyper- 
planes in each of those subsets. Let Mk be the model generated by the partition C\, . . . , Ck- 
Set Mk = Mk if \C k \ > n m i n for all k. 

3.5 Tunable Parameters 

CAP has two tunable parameters, L and D. L specifies the number of knots used when 
generating candidate models for a split. Its value is tied to the smoothness of fo and after a 
certain value, usually 5 to 10 for most functions, higher values of L offer little fitting gain. 

The parameter D is used to specify a minimum subset size, \C k \ > n/(Dlog(n)). Here 
D transforms the base of the logarithm from e into exp(l/D). We have found that D = 3 
(implying base ~ 1.4) is a good choice for most problems. 

Increases in either of these parameters increase the computational time. Sensitivity to 
these parameters, both in terms of predictive error and computational time, is empirically 
examined in Section [6.1.31 

4 Theoretical Properties of CAP 

In this section, we give the computational complexity for the CAP algorithm and conditions 
for consistency. Since CAP is similar to existing adaptive partitioning methods, we can 
leverage existing results to show consistency. 



9 




Figure 2: (A) Number of observations n (log scale) vs. runtime in minutes (log scale), and 
(B) number of observations n (log scale) vs. mean absolute error (linear scale) for the least 
squares estimator (LSE), and CAP. Here xeK 5 and y — (xi + .5x 2 + x 3 ) 2 — x 4 + .25xg + e, 
where e ~ N(0,1). The covariates are drawn from a 5 dimensional standard Gaussian 
distribution, N 5 (0,I). 



4.1 Computational Complexity 

Computational complexity describes the number of bit operations a computer must do to 
perform a routine, such as CAP. It is useful to determine small sample runtimes and how well 
routines will scale to larger problems. The computational complexity of the least squares 
estimator is unworkably high at 0((p + l) 3 n 3 ) flops to solve a problem with n observations in 

). The worst case computational complexity of CAP is 
much lower, at 0(p(p+ l) 2 n log(n) log(log(n))) flops when implemented as in Section[3] The 
most demanding part of the CAP algorithm is the linear regression; each one has complexity 
0((p+ l) 2 n). For iteration k of the algorithm, Lpk linear regressions are fit. This is done for 
k = 1, . . . , K, where K is bounded by D log(n). Putting this together we obtain the above 
complexity. 

To demonstrate how much these factors matter in practice, we empirically compare CAP, 
Fast CAP and LSE on a small problem, y = (xi + .5x2 + x^) 2 — 24 + .25xg + e, where x e M 5 , 
x ~ N 5 (0,I) and e ~ N(0, 1). The runtimes and mean absolute errors of each method are 
shown in Figure [2j 



p dimensions (Monteiro fc Adler||1989 



4.2 Consistency 

We now show consistency for the CAP algorithm. Consistency is shown in a similar manner 
to consistency for other adaptive partitioning models, like CART (Breiman et al. 1984), 



treed linear models (Chaudhuri et al. 1994) and other variants (Nobel 1996, Gyorfi et al. 
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2002). We take a two-step approach, first showing consistency for the mean function and 



first derivatives of a more traditional treed linear model based on CAP under the metric 
and then we use that to show consistency for the CAP estimator itself. 

Letting M* be the model for the CAP estimate after n observations, define the discon- 
tinuous piecewise linear estimate based on M* 



Kn 



/n( X ) = ("* + ^ X ) M^A k ) 



k=l 



where K n is the partition size, A\, . . . ,Ax n are the covariate partitions and (a^, ^k)k=i are 
the hyperplanes associated with M*. Likewise, let / n (x) be the CAP estimator based on 



M* 

/ n (x) = max a k + /?Jx. 
ke{i,...,K n y 

Each subset A k has an associated diameter, d nk , where 

d nk = sup \\xx -x 2 \\ 2 . 

Define the empirical covariate mean for subset k as x k = t^j X^eC* x *- For x « ^ define 



d nl ( X * - X fc) 



T 

i 

ec fe 



Note that (a^, (5 k ) = G k ^ i&Ck ^iVi whenever G k is nonsingular. 

Let xi, . . . ,x„ be i.i.d. random variables. We make the following assumptions: 

Al. X is compact and fo is Lipschitz continuous and continuously different iable on X with 
Lipschitz parameter (. 

A2. There is an a > such that E [ e a|y ~ /o(:r)l | X = x] is bounded on X. 

A3. Let Afc be the smallest eigenvalue of \C k \~ 1 G k and A n = mhifcAfc. Then X n remains 
bounded away from in probability as n — > oo. 

A4. The diameter of the partition rnaxt d~l — > in probability as n — > oo. 

Assumptions Al. and A2. place regularity conditions on fo and the noise distribution, 
respectively. Assumption A3, is a regularity condition on the covariate distribution to 
ensure the uniqueness of the linear estimates. Assumption A4. is a condition that can be 
included in the algorithm and checked along with the subset cardinality, \C k \. If X is given, 
it can be computed directly, otherwise it can be approximated using {xj : i G C k }. In some 
cases, such as when f is strongly convex, A4. will be satisfied without enforcement due to 
problem structure. 

To show consistency of /„ under the ioo metric, we first show consistency of /* and its 



derivatives under the ^ metric in Theorem 4.1 This is very close to Theorem 1 of Chaudhuri 



et al. ( 1994 ) for treed linear models, although we need to modify it to allow partitions with 



an arbitrarily large number of faces. 
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Theorem 4.1. Suppose that assumptions Al. through A4. hold. Then, 



max sup lojjt + /3jx — /o( x )| — > 0, 

k=l,...,K„ xgj4fe 

in probability as n — > oo. 



max sup ||& - V/oCxJII^ -> 

k=l,...,K n xeA 



The CAP algorithm is similar to the SUPPORT algorithm of Chaudhuri et al. (1994), 
except the refitting step of CAP allows partition subsets to be polyhedra with up to Dlog(n) 
faces. Theorem 4.1 is analogous to Theorem 1 of Chaudhuri et al. (1994); to prove our 
theorem, we modify parts of the proof in Chaudhuri et al. ( 1994 ) that rely on a fixed number 
of polyhedral faces. As such, we first need to modify Lemma 12.27 of Breiman et al. (1984). 



Lemma 4.2 (Modification of Lemma 12.27 of Breiman et al. (1984)). Suppose that A2. 
holds and that there exists a k n — > oo where fe n /log(n) — > oo. Then, for every compact set 
B in X and every e > and c > 0, 



lim n c P 



> e for k such that A k C B, \C^\ > k n D\og{n) 



Proof. To prove this Lemma, we only need to lift the restriction on the number of faces 
of the polyhedron A k from being bounded by a fixed M to Dlog(n). First, we note that 
\Ck\ > n m in implies that K n < Dlog(n). Following the proof in Breiman et al. (1984), we 
note that 



P 



iec fc 



> e for k such that A k C B, |C&| > A; n -Dlog(n) 



< 2~L+(Dlog(n))(p+2) n (Dlo g (n))(p+2)-e 2 k n /2C 

for a fixed constant C depending on assumption 6. Since k n J log(n) — >■ oo, the conclusion 
holds. □ 



With Lemma [472] , the proof of Theorem |4 . 1 1 follows directly from the arguments of |ChaucL 

extension to consistency for f n under the metric 



huri et al. (1994). 



Using the results from Theorem 



4.1 



is fairly simple; this is given in Theorem 4.3 



Theorem 4.3. Suppose that assumptions Al. through A4. hold. Then, 



sup 



/n(x) - / (x) 



->■ 



in probability as n — > oo. 

Proof. Fix e > 0; let dx be the diameter of X . Choose n such that 



max sup I offc + /3jx — /o( x ) I > 

k=l,...,K n xe A k 



P 



max sup Hft - V/ (x) 

k=l,...,K n xeAk 



(dx 
e 



'°° > Cd 



x 



<e/2, 
< e/2 
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Fix a 5 net over X such that at least one point of the net sits in A\. for each k = 1, . . . , K. 
Let n$ be the number of points in the net and let xf be a point. Then, 



pi sup 



/n(x) - /o(x) 



> e 



sup 



max a k + /3j*x - / (x) 

fe=l,...,K„ 



> e 



< P < max 

i=l,... ,n s 



< P < max 

i=l,...,n s 



max a k + /3 fc x i - / (xJ 
k=i,...,.K n 



> 



A' 



fc=l 



> 



< e. 



□ 



5 Model Selection and Modifications 



The terminal model produced by the CAP algorithm often overfits the data and is com- 
putationally more intensive than necessary. In this section, we derive a generalized cross- 
validation method to select the best model from all of those produced by CAP, Mi, . . . , Mk- 
We then propose an approximate algorithm, Fast CAP, that requires substantially less com- 
putation than the original algorithm. 



5.1 Generalized Cross- Validation for CAP 



Cross-validation is a method to assess the predictive performance of statistical models and 
is routinely used to choose tunable parameters. In this case, we would like to choose the 
cardinality of the partition, K. As a fast approximation to leave-one-out cross-validation, 
we use generalized cross-validation ( Golub et al.||1979 Friedman||1991 ). In a linear regression 
setting, 



i=l i=l \ 11 / i=l \ n J 



where Ha is the i th diagonal element of the hat matrix, X(X T X) 1 X T , /_j is the estimator 
conditioned on all of the data minus element i, and v is the degrees of freedom. 

A given model Mk is generated by a collection of linear models. A similar type approx- 
imation to leave-one-out cross-validation can be used to select the model size. The model 
M K is defined by C\, . . . ,Ck, the partition, and the hyperplanes (ak, (3k)k=n which were 
generated by the partition. Let (a k % \fi k ^fLi be the collection of hyperplanes generated 
when observation i is removed; notice that if i 6 C k , only (a^, (3k) changes. Let j-iK be the 
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estimator for model Mr- with observation i removed. Using the derivation in Equation Q, 

1 n ^ \ n ( T \ ^ 

-£(w -/-**(*)) =-E(^-^ a i } ^ ) +/3l" l)T x J j , 

— "I — 1 \ / 



n / '„.. _ ... _ «T 



1 / ^ - a k(i) - ^ w xA 

-Lh — > ( 5 ) 

where, in a slight abuse of notation, is the diagonal entry of the hat matrix for subset k 
corresponding to element i, and 



k(i) = arg max 



To select i^, we find the K that minimizes the right hand side of Equation ([5]). Although more 
computationally intensive than traditional generalized cross-validation, the computational 
complexity for CAP generalized cross-validation is similar to that of the CAP split selection 
step. 



5.2 Fast CAP 

The CAP algorithm offers two main computational bottlenecks. First, it searches over all 
cardinal directions, and only cardinal directions, to produce candidate models. Second, 
it keeps generating models until no subsets can be split without one having less than the 
minimum number of observations. In most cases, the optimal number of components is much 
lower than the terminal number of components. 

To alleviate the first problem, we suggest using P' random projections as a basis for 
search. Using ideas similar to compressive sensing, each projection gj ~ N p (0,I) for j = 
1, . . . , P' . Then we search along the direction gjx rather than Xj. When we expect the true 
function to live in a lower dimensional space, as is the case with superfluous covariates, we 
can set P' < p. 

We solve the second problem by modifying the stopping rule. Instead of fulling growing 
the tree until each subset has less than 2n/(21og(n)) observations, we use generalized cross- 
validation. We grow the tree until the generalized cross-validation value has increased in 
two consecutive iterations or each subset has less than 2n/(21og(n)) observations. As the 
generalized cross-validation error is usually concave in K, this heuristic often offers a good 
fit at a fraction of the computational expense of the full CAP algorithm. 

The Fast CAP algorithm has the potential to substantially reduce the log(n) log(log(n)) 
factor by halting the model generation long before K reaches Dlog(n). Since every feasible 
partition is searched for splitting, the computational complexity grows as k gets larger. 

The Fast CAP algorithm is summarized as follows. 
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Fast Convex Adaptive Partitioning (Fast CAP) 



1. Initialize. As in CAP. 

2. Split. 

a. Generate candidate splits. Generate candidate model Mkji by 1) fixing a subset k, 
2) generating a random direction j with gj ~ N p (0, /), and 3) dyadically dividing 
the data as follows: 

• set xH in = min{gjxi : i G C k }, x^ ax = max{gjx; : i G C k } and b kje = 

a £ X min (1 ~~ a t) X max 

• set 

C k = {i : i G C k , gjxi < b kje }, C K +i = {i : i G C k , gjx^ > b kje }, 
A k = {x : x G A k , gjx < 6 fc ^}, = {x : x E A k , gjx > 6 fci£ }. 

Then new hyperplanes are fit to each of the new subsets. This is done for L knots, 
P' dimensions and K subsets. 

b. Select split. As in CAP. 

3. Refit. As in CAP. 

4. Stopping conditions. Let GCV(Mk) be the generalized cross-validation error for 
model M K . Stop if GCV(M K ) > GCV(M K ^) and GCV(M K ^) > GCV(M K „ 2 ). 
Then select final model as in CAP. 



6 Applications 

In this section, we empirically analyze the performance of CAP. There are no benchmark 
problems for multivariate convex regression, so we analyze the predictive performance, run- 
time, sensitivity to tunable parameters and rates of convergence on a set of synthetic prob- 
lems. We then apply CAP to value function approximation for pricing American basket 
options. 



6.1 Synthetic Regression Problems 

We apply CAP to two synthetic regression problems to demonstrate predictive performance 
and analyze sensitivity to tunable parameters. The first problem has a non-additive struc- 
ture, high levels of covariate interaction and moderate noise, while the second has a simple 
univariate structure embedded in a higher dimensional space and low noise. Low noise or 
noise free problems often occur when a highly complicated convex function needs to be 
approximated by a simpler one (Magnani & Boyd 2009). 
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Problem 1 Here x6l 5 . Set 

y — (xx + .5x 2 + X3) 2 - %a + -25x5 + e, 

where e ~ N(0, 1). The covariates are drawn from a 5 dimensional standard Gaussian 
distribution, N 5 (0,I). 

Problem 2 Here x G IR 10 . Set 

y = exp (x T p) + e, 

where p was randomly drawn from a Dirichlet(l,. . . ,1) distribution, 

p = (0.0680, 0.0160, 0.1707, 0.1513, 0.1790, 0.2097, 0.0548, 0.0337, 0.0377, 0.0791) T . 

We set e ~ iV(0,0.1 2 ). The covariates are drawn from a 10 dimensional standard Gaussian 
distribution, iVio(0, /). 

6.1.1 Predictive Performance and Runtimes 

We compared the performance of CAP and Fast CAP to other regression methods on Prob- 
lems 1 and 2. The only other convex regression method included was least squares regression 
(LSE); it was implemented with the cvx convex optimization solver. The general methods 



included Gaussian processes (Rasmussen & Williams 2006), a widely implemented Bayesian 



nonparametric method, and two adaptive methods: tree regression with constant values in 



the leaves and Multivariate Adaptive Regression Splines (MARS) (Friedman 1991). Tree 
regression was run through the Matlab function classregtree. MARS was run through the 
Matlab package ARESLab. Gaussian processes were run with the Matlab package gpml. 

Parameters for CAP and Fast CAP were set as follows. The log scale parameter set as 
D = 3 and the number of knots was set as L = 10 for both. In Fast CAP, the number of 
random search directions was set to be p. 

All methods were given a maximum runtime of 90 minutes, after which the results were 
discarded. Methods were run on 10 random training sets and tested on the same testing set. 
Average runtimes and predictive performance are given in Table [TJ 

Unsurprisingly, the non-convex regression methods did poorly compared to CAP and 
Fast CAP, particularly in the higher noise setting. Gaussian processes offered the best 
performance of that group, but their computational complexity scales like 0(n 3 ); this com- 
putational times of more than 90 minutes for n — 10, 000. More surprisingly, however, the 
LSE did extremely poorly. This can be attributed to overfitting, particularly in the bound- 
ary regions; this phenomenon can be seen in Figure [T] as well. While the natural response to 
overfitting is to apply a regularization penalty to the hyperplane parameters, implementation 
in this setting is not straightforward. We have tried implementing l 2 penalties on the hy- 
perplane coefficients, but tuning the parameters quickly became computationally infeasible 
due to runtime issues with the LSE. 

Although CAP and Fast CAP had similar predictive performance, their runtimes often 
differed by an order of magnitude with the largest differences on the biggest problem sizes. 
Based on this performance, we would suggest using Fast CAP on larger problems rather 
than the full CAP algorithm. 
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Mean Squared Error 
Problem 1 



TV f „i 1 1 

Met nod 


n = 100 


n = 200 


n = 500 


n = 1,000 


O AAA 

n = 2, 000 


C AAA 

n = 5, 000 


1 A AAA 

n = 10, 000 


CAP 


1.5884 


0.6827 


0.2740 


0.1644 


0.0927 


0.0629 


0.0450 


Fast CAP 


1.8661 


0.7471 


0.3197 


0.1526 


0.1356 


0.0724 


0.0566 


LSE 


15.8340 


9.5970 


18.0701 


9,862.4602 








Tree 


12.2794 


9.8356 


6.7606 


5.3478 


4.1230 


2.9173 


2.3152 


CP 


8.5056 


13.5495 


6.8472 


3.7610 


2.2928 


1.2058 




MARS 


8.3517 


8.0031 


6.8813 


6.2618 


5.9809 


5.8558 


5.8234 








Problem 2 








Method 


n = 100 


n = 200 


n = 500 


n = 1,000 


n = 2, 000 


n = 5,000 


n = 10,000 


CAP 


0.0159 


0.0138 


0.0110 


0.0018 


0.0012 


0.0007 


0.0003 


Fast CAP 


0.0159 


0.0138 


0.0090 


0.0018 


0.0011 


0.0007 


0.0003 


LSE 


0.6286 


0.2935 


31.2426 










Tree 


0.1372 


0.1129 


0.0928 


0.0797 


0.0670 


0.0552 


0.0495 


GP 


0.0109 


0.0063 


0.0039 


0.0027 


0.0047 


0.0076 




MARS 


0.0205 


0.0140 


0.0120 


0.0110 


0.0105 


0.0102 


0.0100 



Run Time 
Problem 1 



Method 


n = 100 


n = 200 


n = 500 


n = 1,000 


n = 2, 000 


n = 5,000 


n = 10,000 


CAP 


0.15 sec 


0.24 sec 


0.78 sec 


1.34 sec 


2.18 sec 


4.33 sec 


9.31 sec 


Fast CAP 


0.04 sec 


0.07 sec 


0.15 sec 


0.30 sec 


0.57 sec 


1.14 sec 


2.06 sec 


LSE 


1.56 sec 


10.17 sec 


226.20 sec 


43.37 min 








Tree 


0.06 sec 


0.02 sec 


0.04 sec 


0.09 sec 


0.19 sec 


0.49 sec 


1.15 sec 


GP 


0.22 sec 


0.35 sec 


1.35 sec 


5.07 sec 


22.03 sec 


248.72 sec 




MARS 


0.22 sec 


0.34 sec 


0.76 sec 


1.81 sec 


3.95 sec 


16.65 sec 


56.19 sec 








Problem 2 








Method 


n = 100 


n = 200 


n = 500 


n = 1,000 


n = 2, 000 


n = 5,000 


n = 10,000 


CAP 


0.05 sec 


0.25 sec 


2.15 sec 


6.35 sec 


10.06 sec 


21.06 sec 


46.50 sec 


Fast CAP 


0.02 sec 


0.03 sec 


0.08 sec 


0.13 sec 


0.25 sec 


0.89 sec 


2.03 sec 


LSE 


1.86 sec 


15.13 sec 


339.16 sec 










Tree 


0.02 sec 


0.03 sec 


0.07 sec 


0.14 sec 


0.27 sec 


0.71 sec 


1.53 sec 


GP 


0.15 sec 


0.34 sec 


1.46 sec 


4.93 sec 


23.13 sec 


264.77 sec 




MARS 


0.72 sec 


0.48 sec 


1.38 sec 


3.43 sec 


8.01 sec 


33.29 sec 


98.75 sec 



Table 1: CAP, Fast CAP, Least Squares Estimator (LSE), Gaussian Processes (GP), MARS 
and tree regression were run on Problems 1 and 2. Errors are in distance to the true mean 
function. The lowest error values are bolded. 
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6.1.2 CAP and Treed Linear Models 



Treed linear models are a popular method for regression and classification. They can be 
easily modified to produce a convex regression estimator by taking the maximum over all 
of the linear models. CAP differs from existing treed linear models in how the partition 
is refined. First, subset splits are selected based on global reduction of error. Second, the 
partition is refit after a split is made. To investigate the contributions of each step, we 
compare to treed linear models generated by: 1) local error reduction as an objective for 
split selection and no refitting, 2) global error reduction as an objective function for split 
selection and no refitting, and 3) local error reduction as an objective for split selection 
along with refitting. All estimators based on treed linear models are generated by taking the 
maximum over the set of linear models in the leaves. 

We wanted to determine which properties led to a low variance estimator with low pre- 
dictive error. By low variance, we mean that changes in the training set do not lead to large 
changes in predictive error. To do this, we compared the performance of these methods on 
Problems 1 and 2 over 10 different training sets and a single testing set. All treed linear 
model parameters were the same as those for CAP. We viewed a model with local subset split 
selection and no refitting as a baseline. We compared both the average squared predictive 
error and the variance of that error between training sets. Percentages of average error and 
variance reduction are displayed in Table [2] Average predictive error is displayed in Figure 

El 

Table [2] shows that global split selection and refitting are both beneficial, but in different 
ways. Refitting dramatically reduces predictive error, but can variance to the estimator 
in noisy settings. Global split selection modestly reduces predictive error but can reduce 
variance in noisy settings, like Problem 1. The combination of the two produces CAP, which 
has both low variance and high predictive accuracy. 

6.1.3 Sensitivity to Tunable Parameters 

In this subsection, we empirically examine the effects of the two tunable parameters, the 
log factor, D, and the number of knots, L. The log factor controls the minimal number of 
elements in each subset by setting |C&| > n/(D log(n)), and hence it controls the number 
of subsets, K, at least for large enough n. Increasing D allows the potential accuracy of 
the estimator to increase, but at the cost of greater computational time due to the increase 
in possible values for K and the larger number of possibly admissible sets generated in the 
splitting step of CAP. 

We compared values for D ranging from 0.1 to 20 on Problems 1 and 2 with sample 
sizes of n = 500 and n = 5, 000. Results are displayed in Figure |4| Note that error may 
not be strictly decreasing with D because different subsets are proposed under each value. 
Additionally, Fast CAP is a randomized algorithm so variance in error rate and runtime is 
to be expected. 

Empirically, once D > 1, there was little substantive error reduction in the models, but 
the runtime increased as 0(D 2 ) for the full CAP algorithm. Since D controls the maximum 
partition size, K n = D\og(n), and a linear regression is fit A^log(A) times, the expected 
increase in the runtime should only be 0(Dlog(D)). We believe that the extra empirical 
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Percentage Reduction in Mean Squared Error 
Problem 1 



Method 


n = 100 


n = 200 


n = 500 


n = 1,000 


n = 2, 000 


n = 5,000 


n = 10,000 


Refitting 


48.65% 


58.95% 


32.62% 


61.76% 


73.04% 


74.77% 


70.01 % 


Global Selection 


24.67% 


34.85% 


21.32% 


23.46% 


29.40% 


30.48% 


19.23% 


CAP 


68.25% 


69.81% 


74.74% 


76.97% 


80.18% 


81.40% 


81.04% 








Problem 2 








Method 


n = 100 


n = 200 


n = 500 


n = 1,000 


n = 2, 000 


n = 5,000 


n = 10, 000 


Refitting 


0.0% 


0.0% 


-17.73% 


71.48% 


78.36% 


79.67% 


77.05% 


Global Selection 


0.0% 


0.0% 


-4.36% 


17.69% 


15.22% 


25.04% 


9.74% 


CAP 


0.0% 


0.0% 


-17.10% 


71.70% 


75.60% 


81.66% 


86.21% 



Percentage Reduction in Variance of Mean Squared Error 
Problem 1 



Method 


n = 100 


n = 200 


n = 500 


n = 1,000 


n = 2, 000 


n = 5,000 


n = 10,000 


Refitting 


19.16% 


65.00% 


-243.33% 


-4.03% 


-163.40% 


64.86% 


-18.88% 


Global Selection 


38.41% 


68.78% 


-17.84% 


61.34% 


24.51% 


91.44% 


75.97% 


CAP 


96.89% 


92.72% 


68.74% 


97.05% 


74.85% 


95.29% 


63.17% 








Problem 2 








Method 


n = 100 


n = 200 


n = 500 


n = 1,000 


n = 2, 000 


n = 5,000 


n = 10,000 


Refitting 


0.0% 


0.0% 


-61.34% 


44.75% 


94.16% 


73.93% 


75.42% 


Global Selection 


0.0% 


0.0% 


-19.84% 


-223.58% 


-209.92% 


-8.29% 


-7.17% 


CAP 


0.0% 


0.0% 


-76.78% 


52.44% 


89.30% 


30.18% 


15.16% 



Table 2: Percentage reduction in mean squared error and variance of mean squared error 
compared to treed linear model with local split selection and no refitting on Problems 1 and 
2 over 10 training sets. This model was compared to 1) a treed linear model with partition 
refitting but local split selection, 2) a treed linear model with global split selection but no 
partition refitting, and 3) CAP. 
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Figure 3: Number of observations (log scale) vs. mean squared error (linear scale) for CAP 
and treed linear models with 1) local split selection, with no refitting, 2) local split selection, 
with refitting, and 3) global split selection, with no refitting. Mean error plus/minus one 
standard deviation are shown for data taken from 10 training sets. 
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Problem 1 



Problem 2 





Fast CAP, n=500 - © - CAP, n=5,000 - a - Fast CAP, n=5,000 



Figure 4: (Top) Log factor D (log scale) vs. mean absolute error (linear scale) for CAP 
and Fast CAP. (Bottom) Log factor D (log scale) vs. runtime in seconds (log scale). Both 
methods were run on Problem 1 (left) and Problem 2 (right) with n = 500 and n = 5, 000. 



growth comes from an increased number of feasible candidate splits. In the Fast CAP 
algorithm, which terminates after generalized cross-validation gains cease to be made, we 
see runtimes leveling off with higher values of D. Based on these results, we believe that 
setting D = 3 offers a good balance between fit and computational expense. 

The number of knots, L, determines how many possible subsets will be examined during 
the splitting step. Like D, an increase in L offers a better fit at the expense of increased 
computation. We compared values for D ranging from 1 to 50 on Problems 1 and 2 with 
sample sizes of n = 500 and n = 5,000. Results are displayed in Figure [5j 

The changes in fit and runtime are less dramatic with L than they are with D. After 
L = 3, the predictive error rates almost completely stabilized. Runtime increased as O(L) 
as expected. Due to the minimal increase in computation, we feel that L = 10 is a good 
choice for most settings. 



6.1.4 Empirical Rates of Convergence 

Although theoretical rates of convergence are not yet available for CAP, we are able to 
empirically examine them. Rates of convergence for multivariate convex regression have 
only been studied in two articles of which we are aware. First, Aguilera et al. (2011) studied 
rates of convergence for an estimator that is created by first smoothing the data, then 
evaluating the smoothed data over an e-net, and finally convexifying the net of smoothed 
data by taking the convex hull. They showed that the convexify step preserved the rates of 
the smoothing step. For most smoothing algorithms, these are minimax nonparametric rates, 
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Problem 1 




0.01 



10 



L10 1 



Problem 2 



0.08, 




100 



- - -<> 




|_10' 



■ CAP, n=500 



Fast CAP, n=500 - • - CAP, n=5,000 - a - Fast CAP, n=5,000 



Figure 5: (Top) Number of knots L (log scale) vs. mean absolute error (linear scale) for 
CAP and Fast CAP. (Bottom) Number of knots L (log scale) vs. runtime in seconds (log 
scale). Both methods were run on Problem 1 (left) and Problem 2 (right) with n = 500 and 
n = 5,000. 



Hannah & Dunson 



(2011) 



n~p+2 with respect to the empirical £2 norm. In the second article 
showed adaptive rates for a Bayesian model that places a prior over the set of all piecewise 
linear functions. Specifically, they showed that if the true mean function / actually maps a 
(i-dimensional linear subspace of X to R, that is 



fo( 



0o (x A), 



A G 



then their model achieves rates of log l (n)n d + 2 with respect to the empirical £2 norm. 
Empirically, we see these types of adaptive rates with CAP. 

In Figure [6j we plotted the number of observations against the square root of the mean 
squared error in a log-log plot for Problems 1 and 2. We then fitted a linear model for both 
CAP and Fast CAP. For Problem 1, p — 5 but d = 3, due to the sum in the quadratic 
term. Likewise, for Problem 2, p = 10 but d = 1 because it is an exponential of a linear 
combination. Under standard nonparametric rates, we would expect the slope of the linear 
model to be — | for Problem 1 and — ^ for Problem 2. However, we see slopes closer to 
— I and — \ for Problems 1 and 2, respectively; values are given in Table 3l These results 



strongly imply that CAP achieves adaptive convergence rates of the type shown by Hannah 
h Dunson (2011) for Problems 1 and 2. 
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Figure 6: Number of observations n (log scale) vs. square root of mean squared error (log 
scale) for (A) Problem 1 and (B) Problem 2. Linear models are fit to find the empirical rate 
of convergence. 



Method 


Problem 1 


Problem 2 


Expected: Rates in p 


-0.1429 


-0.0833 


Expected: Rates in d 


-0.2000 


-0.3333 


Empirical: CAP 


-0.2003 


-0.2919 


Empirical: Fast CAP 


-0.2234 


-0.2969 



Table 3: Slopes for linear models fit to log(ra) vs. \og(\/ M S E) in Figure^ Expected slopes 
are given when: 1) rates are with respect to full dimensionality, p, and 2) rates are with 
respect to dimensionality of linear subspace, d. Empirical slopes are fit to mean squared 
error generated by CAP and Fast CAP. Note that all empirical slopes are closest to those 
for linear subspace rates rather than those for full dimensionality rates. 
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6.2 Pricing Stock Options 



In sequential decision problems, a decision maker takes an action based on a currently 
observed state of the world based on the current rewards of that action and possible future 
rewards. Approximate dynamic programming is a modeling method for such problems based 
on approximating a value-to-go function. Value-to-go functions, or simply "value functions," 
give the value for each state of the world if all optimal decisions are made subsequently. 

Often value functions are known to be convex or concave in the state variable; this is 
common in options pricing, portfolio optimization and logistics problems. In some situa- 
tions, such as when a linear program is solved each time period to determine an action, 
a convex value function is required for computational tractability. Convex regression holds 
great promise for value function approximation in these problems. 

To give a simple example for value function approximation, we consider pricing American 
basket options on the average of N underlying assets. Options give the holder the right- 
but not the obligation — to buy the underlying asset, in this case the average of N individual 
assets, for a predetermined strike price K. In an American option, this can be done at 
any time between the issue date and the maturity date, T. However, American options are 
notoriously difficult to price, particularly when the underlying asset base is large. 

A popular method for pricing American options uses approximate dynamic program- 



ming where continuation values are approximated via regression (Carriere 1996, Tsitsiklis & 



Van Roy 1999, 2001, Longstaff & Schwartz 2001). We summarize these methods as follows; 
see Glasserman (2004) for a more thorough treatment. The underlying assets are assumed 
to have the sample path {X\, . . . , Xt}, where X t = {Si(t), . . . , Sjv(t)} is the set of securities 
at time t. At each time t, a continuation value function, Vt(X t ), is estimated by regressing a 
value function for the next time period, V t+ i(X t+ i) , on the current state, X t . The continua- 
tion value is the value of holding the option rather than exercising given the current state of 
the assets. The value function is defined to be the max of the current exercise value and the 
continuation value. Options are exercised when the current exercise value is greater than or 
equal to the continuation value. 



The procedure to estimate the continuation values is as follows (as summarized in Glasser- 



man 



(2004)): 



0. Define basket payoff function, 



h(X t ) = max 1 1 S k(T) - K, j . 



1. Sample M independent paths, {Xy, . . . , X T j}, j = 1, . . . , M. 

2. At time T, set V T (X Tj ) = h(X Tj ). 

3. Apply backwards induction: for t = T — 1,...,1, 

• given {t4+i(A t+1:; )}^ 1 , regress on {X t j}jL 1 to get continuation value estimates 
{C t {X tj )}f^. 
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set value function, 



V t (X tj ) = max {h(X tj ), C t (X tj )} 



We use the value function defined by Tsitsiklis & Van Roy (1999). 



The regression values are used to create a policy that is implemented on a test set: exercise 
when the current exercise value is greater than or equal to the estimated continuation value. 
A good regression model is crucial to creating a good policy. 

In previous literature, {C^X^)}^ has been estimated by regression splines for a single 



underlying asset (Carriere 



tions (Tsitsiklis & Van Roy 



1996), or least squares linear regression on a set of basis func- 



1999[ |Longstaff fc Schwartz|2001[ passermanpOOl) ) . Regression 



on a set of basis functions becomes problematic when X t j is defined over moderate to high 
dimensional spaces. Well-defined sets of bases such as radial basis functions and polynomials 
require an exponential number of functions to span the space, while manually selecting basis 
functions can be quite difficult. Since the expected continuation values are convex in the 
asset price for basket options, CAP is a simple, nonparametric alternative to these methods. 

We compared the following methods: CAP and Fast CAP with D = 3, L = 10 for 
both and P' = min(iV, 10); regression trees with constant leaves using the Matlab function 
classregtree; least squares using the polynomial basis functions 



;i, Si{t), s?(t), sf(t), smsm, h(x t )), i 



N, J ± i] 



ridge regression on the same basis functions with ridge parameter chosen by 10-fold cross- 
validation each time period from values between 10~ 3 and 10 5 . 

We compared value function regression methods as follows. We simulated on both n = 
10, 000 and n = 50, 000 training samples for a 3-month American basket option with a 
number of underlying assets, N, varying between 1 and 30. Sample paths differed between 
n = 10, 000 and n = 50, 000. All asset sample paths were generated by a geometric Brownian 
motion with a drift of 0.05 and a volatility of 0.10. All assets had correlation 0.5 and starting 
value 100. The option had strike price 110. Policy values were approximated on 50,000 
testing sample paths. An approximate upper bound was generated using the dual martingale 



methods of Haugh & Kogan (2004) from value functions generated using polynomial basis 
functions based on the mean of the assets, (1, Y, V 2 , V 3 , h(Y)), where Y t = l/N J2i=i x i(t), 
with 5,000 samples. Approximate duality gaps were generated using these values and the 
policy for each method. All values are in discounted dollars. All computations were run in 
Matlab on a 2.66 GHz Intel i7 processor. 

Results are displayed in Table [4} We found that CAP and Fast CAP gave state of the 
art performance without the difficulties associated with linear functions, such as choosing 
basis functions and regularization parameters. We observed a decline in the performance of 
least squares as the number of assets grew due to overfitting. Ridge regularization greatly 
improved the least squares performance as the number of assets grew. Tree regression did 
poorly in all settings, likely due to overfitting in the presence of the non-symmetric error 
distribution generated by the geometric Brownian motion. These results suggest that CAP 
is robust even in less than ideal conditions, such as when data have heteroscedastic, non- 
symmetric error distributions. 

Again, we noticed that while the performances of CAP and Fast CAP were comparable, 
the runtimes were about an order of magnitude different. On the larger problems, runtimes 
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Assets 


Method 


Policy Value 


Upper 


Duality Gap 


Time 
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Table 4: CAP, Fast CAP, tree regression with constant leaves, least squares (LS) and ridge 
regularized least squares were compared for pricing American basket options. Lower bounds 
were generated for each method by implementing the policy given by the value function. 
Computational times for each method are given in seconds. Approximate upper bounds 
were generated using Haugh & Kogan (2004). Duality gaps were calculated as a percentage 
of the approximate upper bound. The best lower bounds for each basket and sample size 
are bolded. 
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for Fast CAP were similar to those for unregularized least squares. This is likely because the 
number of covariates in the least squares regression grew like iV 2 , while all linear regressions 
in CAP only had N covariates. 

7 Discussion 

In this article, we presented Convex Adaptive Partitioning (CAP), a computationally effi- 
cient, theoretically sound and empirically robust method for regression subject to a convexity 
constraint. CAP is the first convex regression method to scale to large problems, both in 
terms of dimensions and number of observations. As such, we believe that it can allow 
the study of problems that were once thought to be computationally intractable. These in- 
clude econometrics problems, like estimating consumer preference or production functions in 
multiple dimensions, approximating complex constraint functions for convex optimization, 
or creating convex value-to-go functions or response surfaces that can be easily searched 
in stochastic optimization. Our preliminary results are encouraging, but some important 
questions remain unanswered. 

1. What are the convergence rates for CAP? Are they adaptive, as they empirically seem 
to be? 

2. The current splitting proposal is effective but cumbersome. Are there less computa- 
tionally intensive ways to refine the current partition? 

3. The modified stopping in Fast CAP provides substantially reduced runtimes with little 
performance degradation compared to CAP. Can this rule or a similarly efficient one 
be theoretically justified? 

We plan to explore this methodology further in the context of value function approxi- 
mation, particularly in the situations where the value functions are searched as part of an 
objective function. 
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